Investigating Measures for Pairwise Document Similarity
نویسندگان
چکیده
The need for a more effective similarity measure is growing as a result of the astonishing amount of information being placed online. Most existing similarity measures are defined by empirically derived formulas and cannot easily be extended to new applications. We present a pairwise document similarity measure based on Information Theory, and present corpus dependent and independent applications of this measure. When ranked with existing similarity measures over TREC FBIS data, our corpus dependent information theoretic similarity measure ranked first.
منابع مشابه
SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity
Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documen...
متن کاملSimilarity Measures in Documents Using Association Graphs
In this paper we present a new model, designated as Association Graph, to improve document representation, facilitating the ontological dimension. We explain how to generate and use this kind of graph. Also, we analyze different document similarity measures based on this representation. A classical vector space model was used to evaluate this model and measures, investigating their strengths an...
متن کاملبررسی قابلیت بهکارگیری سنجه های مرکزیت به عنوان شاخصهای ارتباط استنادی مدارک در بازیابی اطلاعات رابطه ای: مطالعۀ مقدماتی
Purpose: this is a pilot study tends to investigate correlation between centrality measures with bibliographic coupling as a well-known citation-based document similarity measure. Methodology: using citation analysis method, 40 research articles belonging to four engineering/pure disciplines (Physics, Chemistry, Biology, and computer) and four Humanities and Social disciplines (Economics, Edu...
متن کاملPairwise Document Similarity in Large Collections with MapReduce
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a colle...
متن کاملA Pairwise Document Analysis Approach for Monolingual Plagiarism Detection
The task of plagiarism detection entails two main steps, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passag...
متن کامل